Lexical Gaps and Lexicalization: Implications for Word Segmentation Systems for Chinese NLP

نویسنده

  • Chan-Chia Hsu
چکیده

This paper is motivated by the observation that not all adjectives in Chinese have a canonical antonym. For example, most Chinese speakers choose to translate the English word dishonest into a word string bu chengshi ‘not honest’ instead of any antonym candidates of chengshi suggested in antonym dictionaries. Our discourse evidence from corpus data suggests that bu chengshi is evolving into a word in discourse at a faster pace than some other ‘bu + adjective’ strings, and this may result from the lexical gap for a canonical antonym of chengshi and the communicative need for such a word. As a consequence, it is proposed that if the lexicalization process of bu chengshi continues in the future, the string may need to be considered a single word in a segmentation system (i.e., buchengshi ‘dishonest’). For a segmentation system to distinguish between words and phrases, discourse factors should be taken into consideration.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

First Language Activation during Second Language Lexical Processing in a Sentential Context

 Lexicalization-patterns, the way words are mapped onto concepts, differ from one language      to another. This study investigated the influence of first language (L1) lexicalization patterns on the processing of second language (L2) words in sentential contexts by both less proficient and more proficient Persian learners of English. The focus was on cases where two different senses of a polys...

متن کامل

The Role of Lexical Resources in CJK Natural Language Processing

The role of lexical resources is often understated in NLP research. The complexity of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, e...

متن کامل

The Contribution of Lexical Resources to Natural Language Processing of CJK Languages

The role of lexical resources is often understated in NLP research. The complexity of Chinese, Japanese and Korean (CJK) poses special challenges to developers of NLP tools, especially in the area of word segmentation (WS), information retrieval (IR), named entity extraction (NER), and machine translation (MT). These difficulties are exacerbated by the lack of comprehensive lexical resources, e...

متن کامل

The Impact of Metalinguistic English Vocabulary Knowledge and Lexical Inferencing on EFL Learners’ Lexical Knowledge Considering the Cross-Linguistic Issue of L1 Lexicalization

The present study endeavors to unravel the enigma of the psycholinguistic mechanisms underpinning bilingual mental lexicon by analyzing the issue of L1 lexicalization as a construct epitomizing an overarching framework. It involves 78 juniors at the Islamic Azad University, Roudehen Branch. The study inspects the impact of the interventionist/noninterventionist treatments on both sets of lexica...

متن کامل

Normalized Accessor Variety Combined with Conditional Random Fields in Chinese Word Segmentation

The word is the basic unit in natural language processing (NLP), as it is at the lexical level upon which further processing rests. The lack of word delimiters such as spaces in Chinese texts makes Chinese word segmentation (CWS) an interesting while challenging issue. This paper describes the in-depth research following our participation in the fourth International Chinese Language Processing ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012